Introduction

The objective of this project is to build a recommender system, which is based on the knowledge graph. The database contains fields from candidates' resumes and having information about their skills, education institute, level of education etc. and from this information we will create four fields for the knowledge graph which are 'University','Degree Type', 'Degree Level' and 'Skills'. Once these fields are processed, to feed into the graph, a similarity matrix is going to be prepared. The recommender system will be able to suggest the 'top n' matches for the given candidate ID. The pre-processing of the provided dataset will be done using pandas and spacy library, then the graph will be constructed with 'networkx' and finally similarity between the candidates will be calculated with 'SimiRank algorithm'. At the end we can also manually inspect the level of similarity of the recommended candidates with the given candidate.

In [1]:
import timeit
start_time = timeit.default_timer()
In [2]:
import sys
import os
import numpy as np
import pandas as pd
import json
import spacy
import networkx as nx
from spacy.lang.en import English
import matplotlib.pyplot as plt
%matplotlib inline
In [3]:
os.getcwd()
Out[3]:
'C:\\Users\\saurabh\\Desktop\\Knowledge Graph'
In [4]:
sys.version
Out[4]:
'3.7.3 (default, Apr 24 2019, 15:29:51) [MSC v.1915 64 bit (AMD64)]'
In [5]:
os.chdir('C:/Users/saurabh/Desktop/Knowledge Graph')
In [6]:
dataframe= pd.read_json("Filtered01.json")
In [7]:
df = pd.read_json(dataframe['structuredLayout'].to_json(), orient="index")

Data pre-processing

In [8]:
df['Details']=df.Details.astype(str)
In [9]:
df.drop(['Extracurricular','Interests', 'Profile', 'Reference', 'Skills','Experience'], inplace=True, axis = 1)

Extracting University Name

In [10]:
df['University'] = dataframe['universties']
df['Skills'] = dataframe['skillsCluster']
df['Degree'] = dataframe['degrees']

Copy the university column along with the index into a new dataframe

In [11]:
uni_list = []
for index, uni in df['University'].iteritems():
    for u in uni.keys():
        temp = [index, u]
        uni_list.append(temp)

df_uni = pd.DataFrame(uni_list, columns = ['index', 'University'])    
In [12]:
df.drop(['University'], inplace=True, axis = 1)

In the original dataframe the universities are already arranged in the order of universities ranking and we are going to take just the first value (highest ranking uni), so creating a new column 'Match', which has the value False if the value in the index column repeats otherwise True.

In [13]:
df_uni["Match"]= df_uni["index"] == df_uni.shift()["index"]

Drop all the rows where value of the 'match' is true

In [14]:
df_uni.drop(df_uni[df_uni.Match == True].index, inplace=True)   

Reset the index of original dataframe, to create a new column called 'index', so that the above dataframe created with universities can be merged with it, then joined these two dataframes on the common column 'index' and finally set the datframe's index as 'index'(column) as we do not require this column anymore

In [15]:
df.reset_index()
df_2 = df.join(df_uni.set_index('index'))

Drop the original university column as we are done with extracting the universities from the candidate's profile, along with the column 'match'

In [16]:
df_2.drop(['Match'], inplace= True, axis = 1)

Extracting Degree Type

The following code is to extract the degree level and the degree type of the candidate. Worth mentioning here that the first two words of the string have already been extracted, so the following 'for' loop stores that extracted part along with index. We will combine the dataframe with the main dataframe at the end, before passing it to graph

In [17]:
degrees = []
for i,j in df_2['Degree'].iteritems():
    for k,l in j:
        temp = [i,l, k]
        degrees.append(temp)        
df_deg = pd.DataFrame(degrees, columns = ['index','type', 'level'])
In [18]:
# Clean the text
import re

def clean_text(text):
    text = text.replace('\n', ' ')                # remove newline
    text = text.replace('/', ' ')                 # remove forward slashes
    text = re.sub(r'[^a-zA-Z ^0-9]', '', str(text)) # letters and numbers only
    text = text.lower()                           # lower case
    text = re.sub(r'(x.[0-9])', '', text)         # remove special characters
    return text

Apply the function clean_text() to the column 'type'

In [19]:
df_deg['type'] = df_deg.apply(lambda x: clean_text(x['type']), axis=1)

In the following part,each sentence gets seperated into individual token, so that the type of the degree can be extracted

In [20]:
# Initialize the tokenizer
from spacy.tokenizer import Tokenizer
nlp = spacy.load("en_core_web_sm")
tokenizer = Tokenizer(nlp.vocab)

the for loop reads the datapoints row-wise and checking if the token is not empty and then storing the extected token in a list called 'tokens'

In [21]:
tokens = []
for doc in tokenizer.pipe(df_deg['type'], batch_size=500):
    doc_tokens = []
    for token in doc:
        if (token.text != ' '):
            doc_tokens.append(token.text)
    tokens.append(doc_tokens)

Create a new column with the stored tokens

In [22]:
df_deg['token_type'] = tokens
In [23]:
df_deg['token_type'].head(20)
Out[23]:
0     [bachelor, of, computer, science, engineering,...
1     [master, of, information, technology, universi...
2     [bachelors, in, architecture, 5, years, course...
3     [diploma, in, management, 2016, msc, of, compu...
4     [bachelor, of, science, in, electrical, engine...
5     [masters, of, it, combined, up, to, 2, years, ...
6     [master, of, it, machine, learning, software, ...
7     [master, of, science, in, electrical, engineer...
8     [master, in, business, administration, univers...
9     [master, of, analytics, concentration, data, h...
10    [master, of, data, science, monash, university...
11    [bachelor, of, science, in, computer, science,...
12    [bachelor, of, arts, in, russian, yangon, univ...
13    [diploma, in, association, of, business, excec...
14    [masters, in, computer, application, pg, diplo...
15    [diploma, in, advance, computing, from, cdac, ...
16    [diploma, in, advanced, computing, with, mba, ...
17    [bachelor, of, business, administration, altiu...
18    [master, of, business, analytics, deakin, univ...
19    [diploma, in, digital, marketing, niit, seo, s...
Name: token_type, dtype: object

In order to extract tokens efficiently, its important to select the correct index in the most meaningful way and it can be observed from the above cell output that its hard to find a rule in terms of index but, position 2 and 3 gives the type of the degree in most of the cases, however, its important to keep in mind, not to extract too many tokens as later on, it will be passed to the graph and each degree type will act as individual node

In [24]:
df_deg['token_type_selected'] = [i[2:4] for i in df_deg['token_type']] 

After extracting the tokens at specific indexes, there are still few tokens, which are not meaningful and it is required to standardize the degree type; the following list contains all the potential degree types, which later will be compared with each extracted token followed by creating a new column if there is a match

In [25]:
# Degree type list
degree_type = ['computer','data','science', 'information', 'technology', 'architecture', 
                 'management','electrical','business', 'administration', 'engineering'
                 'analytics', 'application','computing', 'digital','marketing',
                 'food','beverage', 'chemistry', 'health','statistics',
                 'analysis', 'mechanical', 'accounting','mathematics','electronics',
                 'telecommunication','property', 'marine','electronics', 'chemical', 'business',
              'construction', 'arts', 'law', 'legal', 'network','digital','media','security',
              'education', 'project','system', 'anthropology', 'sociology', 'design',
              'aviation', 'state','economics', 'physics', 'design','industrial', 'human', 
              'network','archinformation','commerce','psychology','software', 'translation']
In [26]:
df_deg['type'] = df_deg.apply(lambda x: list(set(x['token_type_selected']) & set(degree_type)), axis=1)

The new created column has lists in each row and the following code convert each row items to a string, which is required to treat them as an individual node

In [27]:
df_deg['type'] = [' '.join(str(x) for x in i) for i in df_deg['type']]     

Extracting Degree Level

In the degree dataframe, there are few versions of the same degrees like bachelor has been mentioned as bechelors in, bachelors of, bachelors etc, and similarly for masters and diploma, so the following function will match the first three letters and if, it is 'bac' then we convert the entire string to bachelors, if 'mas' then masters and if 'dip' then diploma and None in all other cases. Before applying the function to the dataframe we will convert it to string, so that we can perform string operation(matching string)

In [28]:
def process_degree(s):
    if 'bac' in s:
        s = s[:s.rindex('bac')] + 'bachelor'
    elif 'mas' in s:
        s = s[:s.rindex('mas')] + 'master'
    elif 'dip' in s:
        s =  s[:s.rindex('dip')] + 'diploma'
    else:
        s= None
    return s    
In [29]:
df_deg['level'] = df_deg['level'].apply(str)    
df_deg['level'] = df_deg.apply(lambda x: process_degree(x['level']), axis=1)

Finally drop the unnecessary columns

In [30]:
df_2.drop(['Degree'], inplace= True, axis = 1)
df_2.drop(['Education'], inplace=True, axis = 1)
df_deg.drop(['token_type'], inplace= True, axis = 1)
df_deg.drop(['token_type_selected'], inplace= True, axis = 1)

Extracting Skills

The following list contains all the potential technical skills, which later will be compared with each extracted token followed by creating a new column if there is a match

In [31]:
# Tech terms list
tech_terms = ['python', 'r', 'sql', 'hadoop', 'spark', 'java', 'sas', 'tableau','mysql',
              'hive', 'scala', 'aws', 'c', 'c++', 'matlab', 'tensorflow', 'excel','angular',
              'nosql', 'linux', 'azure', 'scikit', 'machine learning', 'statistic',
              'analysis', 'computer science', 'visual', 'ai','artificial intelligence', 'deep learning','mongodb',
              'nlp', 'natural language processing', 'neural network', 'mathematic',
              'database', 'oop', 'blockchain','cloud', 'bootstrap', 'unix','agile',
              'html', 'css', 'javascript', 'jquery', 'git', 'photoshop', 'illustrator',
              'word press', 'seo', 'responsive design', 'php', 'mobile', 'design', 'react',
              'security', 'ruby', 'fireworks', 'json', 'node', 'express', 'redux', 'ajax',
              'java', 'api','ios','big data','php','adobe','assembly','wireframe','couchdb', 
              'ui prototype', 'ux writing', 'interactive design','iot','ruby on rails',
              'metric', 'analytic', 'ux research', 'mockup', 'c#','web development',
              'prototype', 'test', 'ideate', 'usability', 'high-fidelity design', 'karma',
              'framework','testing', 'xml','oracle','node.js','scrum','uml','database management',
              'autocad','swift', 'xcode', 'spatial reasoning', 'human interface', 'core data',
              'grand central', 'network', 'objective-c', 'foundation', 'uikit', 'asp.net',
              'cocoatouch', 'spritekit', 'scenekit', 'opengl', 'metal','data engineering',
              'dreamweaver','statistical analysis','coding','basic','logic','docker',
              'ms access','computer vision','html5','sed','abap']
In [32]:
df_2['Skills'] = df_2.apply(lambda x: list(set(x['Skills']) & set(tech_terms)), axis=1)

The explode function take each item of the list and separate them in individual rows, keeping the same index. We require to do this operation as when we pass the dataframe to graph, we want separate node for each skill.

In [33]:
df_final = df_2.explode('Skills')
In [34]:
df_final['Edge'] =  ['edge'] * len(df_final) 

Create an index column, which will be used to join the degree dataframe with the final dataframe

In [35]:
df_final.reset_index(inplace=True)

Merging all the processed fields

The degree dataframe with final dataframe will be merged and will set the left join as we need to use the index column from final dataframe otherwise we get rows only from the degree dataframe and lose rest of the rows

In [36]:
df_final = df_final.merge(df_deg, on='index', how='left')

In order to enhance the readability of the nodes, few of the strings can be replaced to make them more meaningful

In [37]:
df_final['type'] = df_final['type'].replace ({'science computer':'computer science', 
'technology information':'information technology', 'system computer':'computer system',
'application computer':'computer application', 'management project':'project management',
'technology electronics':'electronics technology', 'science electrical':'electrical science', 
'science data':'data science', 'technology computer':'computer technology', 
'science information':'information science','science computing':'computing science'})

while processing degree type, if there is no match empty string has been introduced; which will be lead to nan values, so these values need to be dropped.

In [38]:
df_final['type'] = df_final['type'].replace('', np.nan, regex=True)

List of all the degree types

In [39]:
df_final.drop(['index'], inplace=True, axis = 1)
df_final.dropna(inplace=True)

Constructing Knowledge Graph

To build the knowledge graph with columns; Skills, University and Degree, we will first build individual graphs for Skills and University then combine them,then again combine the built graph with the graph for degree and the final graph will have all the fields in it.

Knowledge Graph for SKills

In [40]:
kg_df = pd.DataFrame({'source' : df_final['Skills'], 'target': df_final['Details'], 'edge':df_final['Edge']})
G_skills = nx.from_pandas_edgelist(kg_df, "source", "target", edge_attr = True, create_using = nx.DiGraph())
fig, ax = plt.subplots(figsize=(30, 40), dpi=80)
pos = nx.spring_layout(G_skills)
nx.draw(G_skills, with_labels=True,node_size= 4500, node_color= 'skyblue', edge_cmap=plt.cm.Blues, pos = pos)
plt.show() 
fig.savefig('skills_KG.png') 

Knowledge Graph for University

In [41]:
kg_df = pd.DataFrame({'target' : df_final['Details'], 'source': df_final['University'], 'edge':df_final['Edge']})
G_uni = nx.from_pandas_edgelist(kg_df, "source", "target", edge_attr = True, create_using = nx.DiGraph())
fig, ax = plt.subplots(figsize=(30, 40), dpi=80)
pos = nx.spring_layout(G_uni)
nx.draw(G_uni, with_labels=True,node_size= 4500, node_color= 'skyblue', edge_cmap=plt.cm.Blues, pos = pos)
plt.show() 
fig.savefig('uni_KG.png') 

Knowledge Graph for Degree Level

In [42]:
kg_df = pd.DataFrame({'source' : df_final['level'], 'target': df_final['Details'], 'edge':df_final['Edge']})
G_degree_level = nx.from_pandas_edgelist(kg_df, "source", "target", edge_attr = True, create_using = nx.DiGraph())
fig, ax = plt.subplots(figsize=(30, 40), dpi=80)
pos = nx.spring_layout(G_degree_level)
nx.draw(G_degree_level, with_labels=True,node_size= 4500, node_color= 'skyblue', edge_cmap=plt.cm.Blues, pos = pos)
plt.show()
fig.savefig('degree_KG.png') 

Knowledge Graph for degree Type

In [43]:
kg_df = pd.DataFrame({'source' : df_final['type'], 'target': df_final['Details'], 'edge':df_final['Edge']})
G_degree_type = nx.from_pandas_edgelist(kg_df, "source", "target", edge_attr = True, create_using = nx.DiGraph())
fig, ax = plt.subplots(figsize=(30, 40), dpi=80)
pos = nx.spring_layout(G_degree_type)
nx.draw(G_degree_type, with_labels=True,node_size= 4500, node_color= 'skyblue', edge_cmap=plt.cm.Blues, pos = pos)
plt.show()
fig.savefig('degree_field KG.png') 

In order to construct the final graph, first we combine graphs of skills and uni and then combine degree and degree_field and finally combine the two combined graph

In [44]:
G_combined = nx.compose(G_skills,G_uni)
In [45]:
G_combined_2 = nx.compose(G_degree_level,G_degree_type)

Final Knowledge Graph

In [46]:
G_final = nx.compose(G_combined,G_combined_2)
fig, ax = plt.subplots(figsize=(30, 40), dpi=80)
pos = nx.spring_layout(G_final)
nx.draw(G_final, with_labels=True,node_size= 4500, node_color= 'skyblue', edge_cmap=plt.cm.Blues, pos = pos)
plt.show() 
fig.savefig('Final_KG.png') 

Recommendation

To calculate the similarity, we will implement SimiRank algorithm, which basically says that "two objects are considered to be similar if they are referenced by similar objects."

In [47]:
sim_final = nx.similarity.simrank_similarity(G_final)
In [48]:
from heapq import nlargest
 
def find_similarity_final(key, graph):
    if key in sim_final:
        top3 = nlargest(4,sim_final.get(key), key=sim_final.get(key).__getitem__)
        return top3[1:]
    else:
        return 'key not exist'

Model Performance by Manual Inspection

In [49]:
find_similarity_final('1315', G_final)
Out[49]:
['751', '1884', '24']

similarity score of all the candidates with given candidate

In [50]:
print(sim_final.get("1315").__getitem__('751'))
print(sim_final.get("1315").__getitem__('1884'))
print(sim_final.get("1315").__getitem__('24'))
0.035
0.03
0.027777777777777776
In [51]:
pd.options.display.max_colwidth = 100
In [52]:
df_final['Degree_level']=df_final['level']
df_final['Degree_type']=df_final['type']
df_final.drop(['level','type'], inplace=True, axis = 1)
In [53]:
df_final[df_final['Details'] == '1315']
Out[53]:
Details Skills University Edge Degree_level Degree_type
555 1315 unix University of Melbourne edge bachelor science
556 1315 ms access University of Melbourne edge bachelor science
557 1315 asp.net University of Melbourne edge bachelor science
558 1315 visual University of Melbourne edge bachelor science
559 1315 azure University of Melbourne edge bachelor science
560 1315 mysql University of Melbourne edge bachelor science
561 1315 jquery University of Melbourne edge bachelor science
562 1315 basic University of Melbourne edge bachelor science
563 1315 aws University of Melbourne edge bachelor science
564 1315 cloud University of Melbourne edge bachelor science
565 1315 oracle University of Melbourne edge bachelor science
566 1315 java University of Melbourne edge bachelor science
567 1315 c# University of Melbourne edge bachelor science
568 1315 agile University of Melbourne edge bachelor science
569 1315 sql University of Melbourne edge bachelor science
In [54]:
df_final[df_final['Details'] == '751']
Out[54]:
Details Skills University Edge Degree_level Degree_type
96 751 visual University of the Philippines edge bachelor science
97 751 mysql University of the Philippines edge bachelor science
98 751 testing University of the Philippines edge bachelor science
99 751 c++ University of the Philippines edge bachelor science
100 751 java University of the Philippines edge bachelor science
101 751 agile University of the Philippines edge bachelor science
102 751 c# University of the Philippines edge bachelor science
In [55]:
df_final[df_final['Details'] == '1884']
Out[55]:
Details Skills University Edge Degree_level Degree_type
612 1884 mysql University of Melbourne edge bachelor chemical
614 1884 aws University of Melbourne edge bachelor chemical
616 1884 cloud University of Melbourne edge bachelor chemical
618 1884 python University of Melbourne edge bachelor chemical
620 1884 iot University of Melbourne edge bachelor chemical
622 1884 autocad University of Melbourne edge bachelor chemical
624 1884 sql University of Melbourne edge bachelor chemical
In [56]:
df_final[df_final['Details'] == '24']
Out[56]:
Details Skills University Edge Degree_level Degree_type
582 24 asp.net Ajou University edge bachelor science
584 24 testing Ajou University edge bachelor science
586 24 aws Ajou University edge bachelor science
588 24 cloud Ajou University edge bachelor science
590 24 iot Ajou University edge bachelor science
592 24 c Ajou University edge bachelor science

Final Words

It can obsevered that the recommended candidates 751, 1884 and 24 have many commonalities with the given candidate 1315. Thus, it can be concluded that the model is performing satisfactorily. However, there are certainly various limitations as well of the project. First of all, we can still explore other attributes as well like interest, certifications etc, which surely will improve the accuracy. Using knowledge graph for recommendation has potential to use datapoints which may remain unused or unseen otherwise, thus exploring these attributes in future can be beneficial.

Time taken for the complete model to run

In [57]:
elapsed = timeit.default_timer() - start_time
elapsed = "{:.2f}".format(elapsed)
print(str(elapsed + ' seconds'))
43.41 seconds